Ruminations about the Introduction to Open Data Science (IODS) 2021
The course Introduction to Open Data Science (IODS) feels like a good introduction to the R-Rstudio-git pipeline advocating for reproducible well-documented research.
Challenges may arise due to
I hope the course will generate mindsets favoring open research equipped with technical skills to produce good research output.
I hope to learn not only technical skills but also ways to endorse open data science.
I found the course in Sisu. Apparently it is also found in the MOOC platform.
date()
## [1] "Thu Dec 2 10:18:18 2021"
This week, we are analyzing and modeling a survey data concerning the approaches to learning. Description of the data can be found here and here.
The data has been preprocessed to include variables on gender, age, attitude, exam points, and scores on deep, surface and strategic questions.
Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.
# read in the data
learning2014 <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header = TRUE, sep = ",")
dim(learning2014)
## [1] 166 7
There are 166 rows and 7 columns.
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The variables are mostly numeric. Gender is of character value, change it to factor.
# change to factor
learning2014$gender <- as.factor(learning2014$gender)
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Seems like most of the students are female, age spans 17-55, exam points are between 7-33 and the question variables are capped at 5 (from documentation, on a scale 1-5).
Using GGlally::ggpairs, we can plot the distribution of the variables, correlation (with significance), boxplots and scatterplots stratified by the gender.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 3)))
p
We see that
points and attitudesurf and deep questionssurf and attitudesurf and stra questionsFor pedagogical reasons, let’s fit a simple multivariable linear regression first with four (instead of three; choosing three would not change the results) explanatory variables, ie. regress exam points on other variables. Choose variables based on their absolute correlation with the exam points.
sort(abs(cor(learning2014[, -1])[, c("points")]))
## deep age surf stra attitude points
## 0.01014849 0.09319032 0.14435642 0.14612247 0.43652453 1.00000000
model <- lm(points ~ attitude + stra + surf + age, data = learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8639 -3.2849 0.3086 3.4548 10.7028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.70644 3.96234 3.459 0.000694 ***
## attitude 3.38964 0.57041 5.942 1.68e-08 ***
## stra 0.93123 0.53986 1.725 0.086459 .
## surf -0.76565 0.80258 -0.954 0.341524
## age -0.09466 0.05346 -1.771 0.078502 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.262 on 161 degrees of freedom
## Multiple R-squared: 0.2226, Adjusted R-squared: 0.2033
## F-statistic: 11.52 on 4 and 161 DF, p-value: 2.987e-08
From the summary, the intercept is very significantly non-zero based on a t-test (H_0: intercept is zero, H_1: intercept is not zero) although the result is not very meaningful as e.g. age cannot be zero in this context.
The attitude variable has a very significant effect on the exam points (p-value about zero for a t-test for H_0: attitude has no effect on the slope when the other variables are kept constant, H_1: attitude has an effect). Variables stra and age are weakly significant or non-significant depending on the nomenclature (same test). Variable surf is not significant, let’s remove it and refit.
model <- lm(points ~ attitude + stra + age, data = learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Now all explanatory variables are significant with a 0.10 significance cutoff.
The F-test is significant, some of the variables are therefore associated with the response variable (have non-zero coefficient).
From the summary above, we see that as attitude increases by one, exam points increase by about 3.48 when other variables are kept constant. For stra, the value is about 1.00. Age seems to have a smaller effect, with an additional year decreasing exam points by around 0.09 if other variables do not change.
Multiple R-squared is 0.2182, meaning that the three explanatory variables account for approximately 22% variation in the exam points. This seems quite low and we should be careful when e.g. predicting exam points for new students with this model. The adjusted R-squared considers the number of explanatory variables and has here a similar, a bit smaller, value to the unadjusted R-squared.
Linear models carry assumptions that need to be verified. We use diagnostic plots for visual assessment.
# residuals vs fitted values
plot(model, which = 1)
There doesn’t seem to be a clear pattern (maybe visually somewhat of a funnel with decreasing variance but Breusch-Pagan test car::ncvTest(model) doesn’t detect heteroskedasticity) in the residuals versus fitted values in the sense that the variance in residuals is similar across the fitted values. Also, the variation is approximately symmetrical around zero. Constant variance assumption appears to be met. Linear model appears to be OK for the data.
Three outliers (set by id.n in plot.lm) are detected (rows 35, 56 and 145), let’s check them.
print(learning2014[c(35, 56, 145), ])
## gender age attitude deep stra surf points
## 35 M 20 4.2 4.500000 3.250 1.583333 10
## 56 F 45 3.8 3.000000 3.125 3.250000 9
## 145 F 21 3.5 3.916667 3.875 3.500000 7
annotate <- c(35, 56, 145, 4, 2) # 4 and 2 for leverage below
# make the plot tighter
op <- par(mfrow=c(3, 1),
mar = c(4, 4, 1, 1),
mgp = c(2, 1, 0))
parameters <- c("attitude", "stra", "age")
for (param in parameters) {
print(param)
plot(learning2014[, param], learning2014$points,
xlab = param, ylab = "points")
text(learning2014[annotate, param],
learning2014[annotate, "points"],
annotate, pos = 4, col = "red")
}
## [1] "attitude"
## [1] "stra"
## [1] "age"
par(op) # return original par
It appears that here are the students that got low exam points despite having a high attitude value.
Plot boxplot for assessment of symmetry of the residual distribution.
boxplot(resid(model))
Does not seem too bad, but is not perfectly symmetric. QQ-plot will aid in deciding normality.
# qq-plot
plot(model, which = 2)
Standardized residuals follow the linear pattern quite reasonably with the same outlier exceptions as above. Hence, there is no strong reason to suspect deviation from normality in the distribution of the residuals.
# residuals vs. leverage
plot(model, which = 5)
Observations 2, 4 and 56 have been marked as outliers, they have relatively high influence of the regression line compared to other observations. We can rerun the model without these to see if the multiple R-squared is increased.
Cook’s distance is also low, supporting the notion that there are no outliers having a great effect on the linear fit.
max(cooks.distance(model))
## [1] 0.1202162
summary(lm(points ~ attitude + stra + age, data = learning2014[-c(2, 4, 56), ]))
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014[-c(2,
## 4, 56), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.822 -3.520 0.348 3.617 10.919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.38472 2.58566 3.243 0.00144 **
## attitude 3.64594 0.53695 6.790 2.11e-10 ***
## stra 0.84219 0.51066 1.649 0.10108
## age 0.01968 0.05677 0.347 0.72930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.008 on 159 degrees of freedom
## Multiple R-squared: 0.2419, Adjusted R-squared: 0.2276
## F-statistic: 16.91 on 3 and 159 DF, p-value: 1.394e-09
Multiple R-square is indeed a bit better. However, removal of the outliers would have to be justified from the data (e.g. are these students somehow different).
library(car)
## Loading required package: carData
vif(model)
## attitude stra age
## 1.004083 1.014233 1.010865
Variance inflation factors are less than 10 (textbook), so there is no concern for collinearity.
We can try automatic stepwise model selection by AIC criterion.
library(MASS)
model.full <- lm(points ~ gender + age + attitude + deep + stra + surf,
data = learning2014)
stepAIC(model.full, direction = "both")
## Start: AIC=558.36
## points ~ gender + age + attitude + deep + stra + surf
##
## Df Sum of Sq RSS AIC
## - gender 1 0.11 4408.5 556.36
## - surf 1 47.79 4456.1 558.15
## - deep 1 49.00 4457.3 558.19
## <none> 4408.3 558.36
## - stra 1 83.78 4492.1 559.48
## - age 1 87.88 4496.2 559.63
## - attitude 1 919.82 5328.2 587.82
##
## Step: AIC=556.36
## points ~ age + attitude + deep + stra + surf
##
## Df Sum of Sq RSS AIC
## - surf 1 47.69 4456.1 556.15
## - deep 1 49.11 4457.6 556.20
## <none> 4408.5 556.36
## - stra 1 88.35 4496.8 557.66
## - age 1 90.16 4498.6 557.72
## + gender 1 0.11 4408.3 558.36
## - attitude 1 999.18 5407.6 588.27
##
## Step: AIC=556.15
## points ~ age + attitude + deep + stra
##
## Df Sum of Sq RSS AIC
## - deep 1 26.61 4482.8 555.14
## <none> 4456.1 556.15
## + surf 1 47.69 4408.5 556.36
## - age 1 75.36 4531.5 556.93
## - stra 1 106.07 4562.2 558.05
## + gender 1 0.02 4456.1 558.15
## - attitude 1 1084.38 5540.5 590.30
##
## Step: AIC=555.14
## points ~ age + attitude + stra
##
## Df Sum of Sq RSS AIC
## <none> 4482.8 555.14
## - age 1 76.62 4559.4 555.95
## + deep 1 26.61 4456.1 556.15
## + surf 1 25.20 4457.6 556.20
## - stra 1 97.64 4580.4 556.71
## + gender 1 0.01 4482.8 557.14
## - attitude 1 1060.72 5543.5 588.39
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
##
## Coefficients:
## (Intercept) age attitude stra
## 10.89543 -0.08822 3.48077 1.00371
The produced model is the same we derived earlier.
Finally, we can try to brute force all linear models of simple combination of explanatory variables.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
res <- olsrr::ols_step_all_possible(model.full)
res[order(res$adjr, decreasing = TRUE), ]
## Index N Predictors R-Square Adj. R-Square
## 62 57 5 age attitude deep stra surf 0.2311318189 0.207104688
## 33 22 3 age attitude stra 0.2181719645 0.203693668
## 52 42 4 age attitude deep stra 0.2228135019 0.203504521
## 54 43 4 age attitude stra surf 0.2225665343 0.203251417
## 63 63 6 gender age attitude deep stra surf 0.2311510113 0.202137842
## 43 44 4 gender age attitude stra 0.2181729919 0.198748718
## 57 58 5 gender age attitude deep stra 0.2228168858 0.198529914
## 59 59 5 gender age attitude stra surf 0.2226046106 0.198311005
## 53 45 4 age attitude deep surf 0.2157221484 0.196236984
## 56 46 4 attitude deep stra surf 0.2154077956 0.195914822
## 17 7 2 attitude stra 0.2048096617 0.195052725
## 38 23 3 attitude deep stra 0.2096692889 0.195033535
## 34 24 3 age attitude surf 0.2081990315 0.193536051
## 40 25 3 attitude stra surf 0.2074263400 0.192749050
## 58 60 5 gender age attitude deep surf 0.2165385715 0.192055402
## 12 8 2 age attitude 0.2011434849 0.191341564
## 61 61 5 gender attitude deep stra surf 0.2158241415 0.191318646
## 27 26 3 gender attitude stra 0.2050965812 0.190376147
## 48 47 4 gender attitude deep stra 0.2098633360 0.190232611
## 32 27 3 age attitude deep 0.2043132366 0.189578297
## 44 48 4 gender age attitude surf 0.2090640320 0.189413449
## 50 49 4 gender attitude stra surf 0.2079025033 0.188223062
## 39 28 3 attitude deep surf 0.2023788945 0.187608133
## 22 29 3 gender age attitude 0.2017804149 0.186998571
## 3 1 1 attitude 0.1905536672 0.185618019
## 18 9 2 attitude surf 0.1952865878 0.185412804
## 42 50 4 gender age attitude deep 0.2048827672 0.185128302
## 49 51 4 gender attitude deep surf 0.2040700385 0.184295381
## 16 10 2 attitude deep 0.1939911024 0.184101423
## 28 30 3 gender attitude surf 0.1970278452 0.182157990
## 8 11 2 gender attitude 0.1919357660 0.182020867
## 26 31 3 gender attitude deep 0.1952588802 0.180356267
## 47 52 4 gender age stra surf 0.0653293394 0.042107708
## 60 62 5 gender age deep stra surf 0.0707276010 0.041687839
## 37 32 3 age stra surf 0.0520482669 0.034493605
## 55 53 4 age deep stra surf 0.0568671481 0.033435276
## 24 33 3 gender age stra 0.0504456777 0.032861338
## 31 34 3 gender stra surf 0.0462022812 0.028539360
## 45 54 4 gender age deep stra 0.0514840047 0.027918390
## 51 55 4 gender deep stra surf 0.0510011757 0.027423565
## 21 12 2 stra surf 0.0363412029 0.024517169
## 25 35 3 gender age surf 0.0420448279 0.024304917
## 41 36 3 deep stra surf 0.0407134651 0.022948900
## 10 13 2 gender stra 0.0346692266 0.022824677
## 46 56 4 gender age deep surf 0.0462679619 0.022572756
## 15 14 2 age surf 0.0340077514 0.022155086
## 14 15 2 age stra 0.0331743748 0.021311484
## 36 37 3 age deep surf 0.0379369858 0.020121004
## 29 38 3 gender deep stra 0.0357520229 0.017895579
## 35 39 3 age deep stra 0.0336895603 0.015794923
## 5 2 1 stra 0.0213517768 0.015384410
## 6 3 1 surf 0.0208387755 0.014868280
## 11 16 2 gender surf 0.0267878521 0.014846599
## 30 40 3 gender deep surf 0.0306219089 0.012670463
## 20 17 2 deep surf 0.0244545065 0.012484623
## 19 18 2 deep stra 0.0219453518 0.009944681
## 7 19 2 gender age 0.0196556536 0.007626889
## 2 4 1 age 0.0086844365 0.002639829
## 1 5 1 gender 0.0086318623 0.002586935
## 23 41 3 gender age deep 0.0198419947 0.001690921
## 9 20 2 gender deep 0.0088743608 -0.003286690
## 13 21 2 age deep 0.0087454938 -0.003417138
## 4 6 1 deep 0.0001029919 -0.005993941
## Mallow's Cp
## 62 5.003969
## 33 3.684101
## 52 4.724219
## 54 4.775293
## 63 7.000000
## 43 5.683889
## 57 6.723519
## 59 6.767418
## 53 6.190730
## 56 6.255739
## 17 4.447461
## 38 5.442477
## 34 5.746530
## 40 5.906325
## 58 8.021891
## 12 5.205636
## 61 8.169637
## 27 6.388125
## 48 7.402347
## 32 6.550123
## 44 7.567646
## 50 7.807853
## 39 6.950150
## 22 7.073917
## 3 5.395638
## 18 6.416857
## 42 8.432342
## 49 8.600417
## 16 6.684767
## 28 8.056761
## 8 7.109816
## 26 8.422587
## 47 37.292359
## 60 38.175985
## 37 38.038920
## 55 39.042363
## 24 38.370340
## 31 39.247886
## 45 40.155611
## 51 40.255461
## 21 39.287183
## 25 40.107658
## 41 40.382987
## 10 39.632952
## 46 41.234303
## 15 39.769746
## 14 39.942091
## 36 40.957170
## 29 41.409026
## 35 41.835549
## 5 40.387035
## 6 40.493125
## 11 41.262841
## 30 42.469948
## 20 41.745383
## 19 42.264283
## 7 42.737798
## 2 43.006675
## 1 43.017547
## 23 44.699262
## 9 44.967398
## 13 44.994048
## 4 44.781340
By adjusted R-Square, it appears that the three variable model we used before fares very well compared to the full model. Attitude alone is also quite good in comparison and could be chosen by parsimony.
date()
## [1] "Thu Dec 2 10:18:29 2021"
This week, we are analyzing and modeling data on two questionnaires related to student performance and alcohol consumption. Further description is available here. Since the description is relevant for analysis we reprint it verbatim:
“This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).”
The dataset has been preprocessed by joining data from two courses, Math and Portugese language. Variables alc_use and high_use have been added in the wrangling phase.
Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.
rm(list=ls())
# read in the data, convert strings to factors
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv",
header = TRUE, sep = ",", stringsAsFactors = TRUE)
dim(alc)
## [1] 370 51
There are 370 rows and 51 columns.
str(alc)
## 'data.frame': 370 obs. of 51 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
Seems like there are mostly integer and character (factorized) valued variables, as well as at least one logical (high_use) and float (alc_use).
# (Type for factor is int.)
table(sapply(alc, typeof))
##
## double integer logical
## 1 49 1
# Check for zero variance (negate characters to get "not all duplicated")
sort(
sapply(alc, function(x) {
ifelse(is.numeric(x), var(x), !all(duplicated(x)[-1L]))
}))
## n failures.p failures traveltime failures.m studytime
## 0.000000e+00 2.398667e-01 3.109939e-01 4.916502e-01 5.049513e-01 7.189922e-01
## Dalc famrel freetime alc_use school sex
## 8.032594e-01 8.304695e-01 9.712224e-01 9.958178e-01 1.000000e+00 1.000000e+00
## address famsize Pstatus Mjob Fjob reason
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## guardian schoolsup famsup activities nursery higher
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## internet romantic paid paid.p paid.m high_use
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## Medu Fedu goout age Walc health
## 1.173984e+00 1.179697e+00 1.273720e+00 1.393987e+00 1.666366e+00 1.981220e+00
## G2.p G1.p G1 G2 G3.p G3
## 6.078518e+00 6.512854e+00 7.301699e+00 7.914627e+00 8.665092e+00 1.080868e+01
## G1.m G2.m G3.m absences.p absences absences.m
## 1.119153e+01 1.444749e+01 2.124131e+01 2.330626e+01 3.029392e+01 5.876224e+01
## cid id.m id.p
## 1.143917e+04 1.274368e+04 3.041907e+04
# Seems OK, only `n` has zero variance.
Begin with the GGally::ggpairs
library(GGally)
library(ggplot2)
p <- ggpairs(alc.sub, mapping = aes(col = sex, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 3)))
p
We see that
G3 is similar for both sexes, and is not normal (by shapiro.test()), has a peak and is left skewedfamrel correlates with alc_use, but significantly only for malesgoout correlates positively with alcohol use with a high significancestudytime correlates negatively with alcohol use with a high significanceAll in all, there are several hypotheses
Let’s take a closer look on the quality of family relationship versus alcohol use
library(tidyverse)
alc_var <- function(alc, expl.variable) {
expl.vars <- c(expl.variable, "alc_use", "sex")
# Tidyverse black magic
res <-
alc %>%
select(one_of(expl.vars)) %>%
count(!!!sapply(expl.vars, as.symbol))
g <- ggplot(res, aes(x = .data[[expl.variable]], y = alc_use, col = sex)) +
geom_point(aes(size = n), alpha = 0.5) + geom_smooth(method = "loess")
return(g)
}
alc_var(alc, "famrel")
With a bit of caution, it appears than men with low quality famrel use more alcohol, as there is an increase in alc_use when famrel goes from medium to bad. However, the relationship is roughly linear (without deep analysis):
summary(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
##
## Call:
## lm(formula = alc_use ~ famrel, data = alc[alc$sex == "M", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7458 -0.9118 -0.1898 0.8102 3.0882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.30184 0.38377 8.604 4.55e-15 ***
## famrel -0.27800 0.09368 -2.968 0.00343 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.128 on 173 degrees of freedom
## Multiple R-squared: 0.04844, Adjusted R-squared: 0.04294
## F-statistic: 8.807 on 1 and 173 DF, p-value: 0.003427
# make the plot tighter
op <- par(mfrow=c(2, 2),
mar = c(4, 4, 2, 1),
mgp = c(2, 1, 0))
plot(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
par(op) # return original par
library(car)
# There is no heteroscedasticity by Breusch-Pagan test
car::ncvTest(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.398195, Df = 1, p = 0.52802
Next, plot the other variables as well
library(ggpubr)
g1 <- alc_var(alc, "goout")
g2 <- alc_var(alc, "sex")
g3 <- alc_var(alc, "studytime")
g4 <- alc_var(alc, "G3")
ggarrange(g1, g2, g3, g4,
ncol = 4, nrow = 1, common.legend = TRUE,
legend = "bottom")
When goout increases, so does the alcohol use. Males appear to use more alcohol. As studytime increases, alcohol usage drops. G3 does not have a clear linear pattern with alcohol use. Otherwise, the above hypotheses are OK.
Now we are ready to fit the logistic regression model with G3, famrel, goout, sex and studytime.
model.1 <- glm(paste0("high_use", " ~ ", paste(interesting_vars[-1], collapse=" + ")),
family = "binomial", data = alc)
summary(model.1)
##
## Call:
## glm(formula = paste0("high_use", " ~ ", paste(interesting_vars[-1],
## collapse = " + ")), family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6895 -0.7784 -0.4891 0.8155 2.6807
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.85026 0.85893 -0.990 0.32222
## G3 -0.03667 0.04031 -0.910 0.36294
## famrel -0.41604 0.13989 -2.974 0.00294 **
## goout 0.77038 0.12402 6.212 5.25e-10 ***
## sexM 0.80171 0.26695 3.003 0.00267 **
## studytime -0.46034 0.17229 -2.672 0.00754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.86 on 364 degrees of freedom
## AIC: 381.86
##
## Number of Fisher Scoring iterations: 4
We got a model with four significant variables. G3 is not significant, remove it and refit.
model.2 <- glm(high_use ~ famrel + goout + sex + studytime,
family = "binomial", data = alc)
summary(model.2)
##
## Call:
## glm(formula = high_use ~ famrel + goout + sex + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5891 -0.7629 -0.4976 0.8304 2.6937
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2672 0.7312 -1.733 0.08309 .
## famrel -0.4193 0.1399 -2.996 0.00273 **
## goout 0.7873 0.1230 6.399 1.57e-10 ***
## sexM 0.7959 0.2669 2.982 0.00286 **
## studytime -0.4814 0.1711 -2.814 0.00490 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 370.69 on 365 degrees of freedom
## AIC: 380.69
##
## Number of Fisher Scoring iterations: 4
There is a lot of output
high_use ~ 1)# Residual deviance
sum(residuals(model.2, type = "deviance")^2)
## [1] 370.6854
See if residuals are roughly linear, otherwise we can’t use GLM.
# Plot residuals and check that linearity is OK
plot(model.2, which = 1) # -> Looks OK.
We can also test with ANOVA if null model is significantly different from our model
# see http://homepage.stat.uiowa.edu/~rdecook/stat3200/notes/LogReg_part2_4pp.pdf
# and https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf
# and https://stats.stackexchange.com/questions/59879/
# 1-pchisq(model.2$null.deviance-model.2$deviance, model.2$df.null-model.2$df.residual)
null <- glm(high_use ~ 1, family = "binomial", data = alc)
anova(null, model.2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: high_use ~ 1
## Model 2: high_use ~ famrel + goout + sex + studytime
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 369 452.04
## 2 365 370.69 4 81.354 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We conclude that at least one of the explanatory variables has non-zero coefficient by this test.
OR <- coef(model.2) %>% exp
CI <- confint(model.2) %>% exp
list(OR = OR, CI = CI)
## $OR
## (Intercept) famrel goout sexM studytime
## 0.2816141 0.6575181 2.1974324 2.2165443 0.6179355
##
## $CI
## 2.5 % 97.5 %
## (Intercept) 0.06545486 1.1622100
## famrel 0.49791884 0.8636137
## goout 1.73873662 2.8198312
## sexM 1.31841735 3.7630180
## studytime 0.43751933 0.8576353
The odds ratio for famrel is 0.66 and 95% CI is [0.50, 0.86]. Thus, increase by 1 unit in famrel is associated with a decrease in the odds of high_use by 14-50%.
Variables goout (increases high_use odds) and studytime (decreases high_use odds) are interpreted analogously and neither containts 1 in the 95% CI.
The sexM is interpreted in relation to implicit sexF: being male increases the high_use odds by around 122% with 95% CI of [1.32, 3.76].
It seems that the results correspond to the hypotheses. It feels like famrel would require more work although the interaction term of sex*famrel is not significant when added to model.2 (data not shown).
observed <- alc$high_use
predicted <- predict(model.2, type = "response") > 0.5
# 2x2 tabulation, confusion matrix
table(observed, predicted)
## predicted
## observed FALSE TRUE
## FALSE 230 29
## TRUE 59 52
Most cases are classified correctly.
Below are the predictions versus actual high_use values, plotted separately for each explanatory variable.
# Get predictions, compare against true values
predicted_abs <- predict(model.2, type = "response")
df <- data.frame(cbind(observed, predicted, predicted_abs,
alc[, interesting_vars[3:6]]))
df$matches <- df$observed == df$predicted
# Plot by explanatory variables
prob.expl <- function(df, ind.var, expl.var, col.var) {
g <- ggplot(df, aes(x = .data[[ind.var]],
y = .data[[expl.var]])) +
geom_point(alpha = 0.5, aes(col = .data[[col.var]])) +
geom_smooth(method = "loess", col = "black", size = 0.5) +
theme_bw()
g
}
g1 <- prob.expl(df, "predicted_abs", "famrel", "matches")
g2 <- prob.expl(df, "predicted_abs", "goout", "matches")
g3 <- prob.expl(df, "predicted_abs", "sex", "matches")
g4 <- prob.expl(df, "predicted_abs", "studytime", "matches")
ggarrange(g1, g2, g3, g4, ncol = 2, nrow = 2, labels = c("A", "B", "C", "D"),
common.legend = TRUE, legend = "bottom")
From the plot above we see where the predictions of high_useare right (greenish points) and wrong (red points). For famrel (A), sex (C) and studytime (D), the predictions seem to be false mostly around probability 0.5. With goout, high_use appears to be poorly predicted when goout is low and predicted value is in the range of [0.5, 0.7].
Quantify at which probability values mismatches tend to occur:
# Predictions equal observation
round(table(cut_interval(df[df$matches == TRUE, c("predicted_abs")], length=0.20)) /
length(df$matches), 2)
##
## [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
## 0.36 0.22 0.08 0.08 0.02
# False predictions
round(table(cut_interval(df[df$matches == FALSE, c("predicted_abs")], length=0.20)) /
length(df$matches), 2)
##
## [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]
## 0.06 0.05 0.10 0.03
This confirms that most false predictions are made around the probability 0.5.
The training error of the model is 0.24, i.e. lowish. If we just guessed using, say, goout, the error would be around 0.28, which is not very bad. Of course the guessing is not very random, since we already know that goout is an important predictor.
# Training error, model.2
sum(df$matches == FALSE) / length(df$matches)
## [1] 0.2378378
# Guess high_use by goout
guesses <- rep(FALSE, times = nrow(df))
guesses[df$goout > mean(df$goout)] <- TRUE
table(df$observed, guesses)
## guesses
## FALSE TRUE
## FALSE 198 61
## TRUE 41 70
sum(guesses != df$observed) / nrow(df)
## [1] 0.2756757
Follow the code in DataCamp for this one
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, df$predicted_abs)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model.2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2459459
The test set performance is a bit better than the DataCamp baseline, 0.25 versus 0.26.
# Select all explanatory variables available
training.vars <- colnames(alc)[1:36] # Discard non-joined .m and .p variables
# Discard alcohol consumption variables and technical variables
training.vars <- training.vars[!(
training.vars %in% c("Dalc", "Walc", "n", "id.p", "id.m"))]
Now, with enough compute we could just try all the variables, but that would amount to 2^32 = 4294967296 models. We instead use stepwise selection by AIC starting with the full model and progressing backwards. We also assume R handles the factor variable to dummy (treatment) encoding. (Note the problems of stepwise regression. It would be advisable to use other methods e.g. through Caret.)
library(MASS)
# Run stepwise AIC selection on full model, keep the model and its AIC
final.m <- stepAIC(
# Full model
glm(paste0("high_use", " ~ ", paste(training.vars[-1], collapse=" + ")),
family = binomial, data = alc),
direction = "backward", trace = FALSE,
# What to keep from models
keep = function(model, aic) { list(model = model, aic = aic) } )
Then perform 10-fold CV for the models
# Init
CVs <- rep(NULL, ncol(final.m$keep)) # Test error
ERRs <- rep(NULL, ncol(final.m$keep)) # Train error
AICs <- rep(NULL, ncol(final.m$keep))
Nvar <- rep(NULL, ncol(final.m$keep))
for (i in 1:ncol(final.m$keep)) { # Each column is a model
interim.m <- final.m$keep[, i][1]$model
CVs[i] <- cv.glm(data = alc, cost = loss_func, glmfit = interim.m, K = 10)$delta[1]
ERRs[i] <- loss_func(alc$high_use, predict(interim.m, type = "response"))
AICs[i] <- final.m$keep[, i][2]$aic
Nvar[i] <- length(final.m$keep[, i][1]$model$coefficients) - 1
}
CV.res <- data.frame(test = CVs, train = ERRs, AIC = AICs, Nvar = Nvar)
# Somewhat unsatisfactory graph with points
g1 <- ggplot(CV.res, aes(x = Nvar, y = test)) + geom_line() +
geom_point(aes(x = Nvar, y = test, color = AIC), size = 5, alpha = 0.8) +
theme_bw() + xlab("Number of variables") + ylab("Test set error") +
ggtitle("10-fold CV on backward stepwise variable selection with AIC") +
scale_color_gradient(low="blue", high="red")
g1
# Better graph with both test and training error, and AIC separately
errors <- gather(CV.res, type, error, c("test", "train"))
g1 <- ggplot(errors, aes(x = Nvar, color = type)) + geom_line(aes(y = error)) +
theme_bw() + xlab("Number of variables") + ylab("Test set error")
g2 <- ggplot(errors, aes(x = Nvar)) + geom_line(aes(y = AIC)) +
theme_bw() + xlab("Number of variables") + ylab("AIC")
g <- ggarrange(g1, g2, ncol = 2, nrow = 1, widths = c(2, 1.5),
common.legend = FALSE, legend = "left")
annotate_figure(g, top = text_grob("10-fold CV on backward stepwise variable selection with AIC"))
The final model was
summary(final.m)
##
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian +
## activities + romantic + famrel + goout + health + absences,
## family = binomial, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8167 -0.7154 -0.4262 0.5635 2.7612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.14937 0.86318 -2.490 0.012772 *
## sexM 0.98275 0.28722 3.422 0.000623 ***
## addressU -0.89491 0.33661 -2.659 0.007847 **
## famsizeLE3 0.47009 0.29971 1.569 0.116759
## reasonhome 0.31183 0.35612 0.876 0.381229
## reasonother 1.24517 0.47895 2.600 0.009328 **
## reasonreputation -0.28654 0.36969 -0.775 0.438278
## guardianmother -0.68303 0.32428 -2.106 0.035179 *
## guardianother 0.41726 0.71637 0.582 0.560256
## activitiesyes -0.51481 0.28383 -1.814 0.069710 .
## romanticyes -0.50684 0.30464 -1.664 0.096171 .
## famrel -0.50627 0.15580 -3.249 0.001156 **
## goout 0.91148 0.13829 6.591 4.36e-11 ***
## health 0.16738 0.10359 1.616 0.106138
## absences 0.09152 0.02373 3.856 0.000115 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 335.02 on 355 degrees of freedom
## AIC: 365.02
##
## Number of Fisher Scoring iterations: 5
Removing all non-significant explanatory variables and iterating yields a model with 0.20 prediction error and 0.22 test set error in 10-fold CV:
parsimonious.m <- glm(formula = high_use ~ sex + address + reason + guardian +
activities + famrel + goout + absences, family = binomial, data = alc)
summary(parsimonious.m)
##
## Call:
## glm(formula = high_use ~ sex + address + reason + guardian +
## activities + famrel + goout + absences, family = binomial,
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8386 -0.7558 -0.4676 0.6257 2.8133
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65723 0.77828 -2.129 0.033225 *
## sexM 1.09542 0.27999 3.912 9.14e-05 ***
## addressU -0.84765 0.33097 -2.561 0.010435 *
## reasonhome 0.24841 0.34895 0.712 0.476533
## reasonother 1.06110 0.46814 2.267 0.023412 *
## reasonreputation -0.35044 0.36541 -0.959 0.337531
## guardianmother -0.65433 0.31871 -2.053 0.040071 *
## guardianother 0.27975 0.69303 0.404 0.686459
## activitiesyes -0.55234 0.28067 -1.968 0.049075 *
## famrel -0.45739 0.15011 -3.047 0.002311 **
## goout 0.87831 0.13550 6.482 9.06e-11 ***
## absences 0.08739 0.02335 3.743 0.000182 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 341.90 on 358 degrees of freedom
## AIC: 365.9
##
## Number of Fisher Scoring iterations: 5
loss_func(alc$high_use, predict(parsimonious.m, type = "response"))
## [1] 0.1972973
cv.glm(data = alc, cost = loss_func, glmfit = parsimonious.m, K = 10)$delta[1]
## [1] 0.2162162
date()
## [1] "Thu Dec 2 10:19:04 2021"
This week, we are analyzing and modeling a dataset from MASS package, “Housing Values in Suburbs of Boston”. References and further description can be found here.
rm(list=ls())
library(MASS)
boston <- MASS::Boston
dim(boston)
## [1] 506 14
The data hase 506 rows and 14 columns.
str(boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Most of the columns are double valued, chas and rad are integer typed.
table(sapply(boston, typeof))
##
## double integer
## 12 2
Summary on the variables is shown below. crim is per capita crime rate by town, tax is a tax rate. chas is a dummy variable describing relative location to Charles River (1 if tract bounds river). Full descriptions are here.
summary(boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Correlation plot shows many strong correlations (without going into p-values):
library(corrplot)
library(tidyverse)
cor.matrix <- cor(boston)
rownames(cor.matrix) <- colnames(cor.matrix)
corrplot(as.matrix(cor.matrix), method = "ellipse",
order = "AOE", number.cex = 0.5, # Use some nice ordering
tl.cex = 1, tl.col = "black", addCoef.col = 'black')
We can filter and show correlations whose absolute value is above 0.7. Only show significant correlation coefficients. We see that e.g. dis and age correlate negatively, while nox and age correlate positively.
cor.matrix.filtered <- cor(boston)
cor.matrix.filtered <- apply(cor.matrix.filtered, 2, function(x) {
ifelse(abs(x) > 0.7, x, 0)
})
rownames(cor.matrix.filtered) <- colnames(cor.matrix.filtered)
# Don't show zeroes. Get the order manually to match that above.
ord <- corrMatOrder(cor.matrix, order="AOE")
cor.matrix.filtered <- cor.matrix.filtered[ord, ord]
col <- ifelse(abs(cor.matrix.filtered - 0) < .Machine$double.eps, "white", "black") #[ord, ord]
# Test for correlation significance.
testRes = cor.mtest(boston, conf.level = 0.95)
corrplot(as.matrix(cor.matrix.filtered),
method = "ellipse", order = "original", number.cex = 0.5,
tl.cex = 1, tl.col = "black",
p.mat = testRes$p, insig = "blank", na.label = " ",
use = "complete.obs", addCoef.col = col)
Plot the variables that correlate highly with some other variable
interesting <- which(
abs(cor.matrix.filtered) > 0.7 & abs(cor.matrix.filtered) < 1,
arr.ind = TRUE)
boston.cor <- boston[, which(
colnames(boston) %in% rownames(interesting))]
# pairs(boston.cor, col = boston$chas)
library(GGally)
lower <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=method, ...)
p
}
p <- ggpairs(boston.cor, mapping = aes(alpha = 0.3),
lower = list(continuous = lower),
upper = list(continuous = wrap("cor", size = 2)))
p
We see that e.g. nox and age have a clear non-linear pattern (modeling with lm reveals heteroscedasticity of the residuals vs. fitted values). Other variables produce two groups of points with rad or tax.
Then plot the other ones that do not correlate strongly.
boston.cor <- boston[, which(! # not
colnames(boston) %in% c(rownames(interesting), "chas"))]
# c("nox", "dis", "tax", "indus", "age", "rad", "medv", "lstat")
# pairs(boston.cor, col = boston$chas)
lower <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=method, ...)
p
}
p <- ggpairs(boston.cor, mapping = aes(alpha = 0.3),
lower = list(continuous = lower),
upper = list(continuous = wrap("cor", size = 2)))
p
Here we also see some interesting patterns, which are left for future analyses.
Then plot variable distributions, stratify by binary chas
library(reshape2)
boston$chas <- as.factor(boston$chas)
boston.m <-
boston %>%
melt(id.vars = c("chas"))
ggplot(boston.m, aes(value, col = chas)) +
geom_histogram(aes(fill = chas)) + facet_wrap(~variable, scales = "free")
Most of the distributions are not close to normal, and most observations belong to chas == 0 class. Distributions of indus, tax and rad have two peaks.
boston.std <- as.data.frame(
scale(boston[, which(!(colnames(boston) %in% c("chas")))]))
boston.std$chas <- boston$chas
summary(boston.std)
## crim zn indus nox
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-1.4644
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.9121
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.1441
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.: 0.5981
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv chas
## Min. :-1.9063 0:471
## 1st Qu.:-0.5989 1: 35
## Median :-0.1449
## Mean : 0.0000
## 3rd Qu.: 0.2683
## Max. : 2.9865
apply(boston.std, 2, var)
## crim zn indus nox rm age dis
## 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000
## rad tax ptratio black lstat medv chas
## 1.00000001 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 0.06451297
All variables except the factorized chas are mean centered, and have unit variance. Plot the distributions for clarity
boston.m <-
boston.std %>%
melt(id.vars = c("chas"))
ggplot(boston.m, aes(value, col = chas)) +
geom_histogram(aes(fill = chas)) + facet_wrap(~variable, scales = "free")
Create categorical variable of crime rate by quantiles
bins <- quantile(boston.std$crim)
crime <- cut(boston.std$crim, breaks = bins,
include.lowest = TRUE,
labels = c("low", "med_low", "med_high", "high"))
boston.std <- dplyr::select(boston.std, -crim)
boston.std$crime <- crime
Divide into train and test set (follow course’s Datacamp exercise)
# standardize the chas as we need later
boston.std$chas <- scale(as.numeric(boston.std$chas))
n <- nrow(boston.std)
set.seed(123) # for reproducibility
# sample n * 0.8 entries from 1:n
ind <- sample(n, size = n * 0.8)
train <- boston.std[ind,]
test <- boston.std[-ind,]
# save the correct classes from test data
test.correct.classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Fit LDA on all variables as instructed. Note that none of the variable distribution is normal by Shapiro-Wilk normality test
# sapply(boston.std %>% select(-chas) %>% select(-crime), shapiro.test)
# fit LDA
lda.fit <- lda(crime ~ ., data = train)
# plot biplot (follow course's Datacamp exercise)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red",
tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2.25)
From the biplot we see high crime rates are separated from the others well. rad, zn, (indus and nox) stand out from other variables, with rad being approximately orthogonal to the other ones. Medium high crime rates are separated pretty well from the low ones. Medium low rates overlap with low and medium high rates. There are a few outliers in all classes that are far from the center of mass of their respective group.
# print lda.fit details
# The prior probabilities seem to reflect the distribution in train data,
# but are not identical
table(train$crime) / nrow(train)
##
## low med_low med_high high
## 0.2549505 0.2500000 0.2500000 0.2450495
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2500000 0.2500000 0.2450495
##
## Group means:
## zn indus nox rm age dis
## low 1.01866313 -0.9066422 -0.8835724 0.38666334 -0.9213895 0.9094324
## med_low -0.07141687 -0.3429155 -0.5768545 -0.09882533 -0.3254849 0.3694282
## med_high -0.40148591 0.2162741 0.3637157 0.12480815 0.4564260 -0.3720478
## high -0.48724019 1.0149946 1.0481437 -0.47733231 0.8274496 -0.8601246
## rad tax ptratio black lstat medv
## low -0.6819317 -0.7458486 -0.4234280 0.3729083 -0.766883775 0.47009410
## med_low -0.5395408 -0.4205644 -0.1079710 0.3103406 -0.164921798 0.01548761
## med_high -0.4349280 -0.3191090 -0.2716959 0.1049654 0.009720124 0.19440747
## high 1.6596029 1.5294129 0.8057784 -0.6383964 0.900379309 -0.71294711
## chas
## low -0.08120770
## med_low 0.03952046
## med_high 0.19544522
## high -0.03371693
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.148390645 0.74870532 -1.0874785
## indus 0.040971465 -0.38126652 -0.1619456
## nox 0.312458150 -0.67564471 -1.3104018
## rm 0.011086947 -0.16058718 -0.1572603
## age 0.283482486 -0.38817624 -0.1013491
## dis -0.079848789 -0.38493775 0.2108038
## rad 3.718978412 0.93123532 -0.4706522
## tax -0.015197127 0.04230505 1.2889075
## ptratio 0.180294774 -0.01592588 -0.3558490
## black -0.136724112 0.02948075 0.1288959
## lstat 0.145739238 -0.37823065 0.3345688
## medv 0.061327205 -0.53906503 -0.1509890
## chas 0.002460776 0.03963849 0.1699312
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0364 0.0113
# predict the classes
test.predictions <- predict(lda.fit, test)
# tabulate the results
table(test.predictions$class, test.correct.classes)
## test.correct.classes
## low med_low med_high high
## low 10 2 1 0
## med_low 10 17 9 0
## med_high 4 6 13 1
## high 0 0 2 27
High class is predicted well, as was expected. The other classes are predicted less accurately, med_high is often predicted as med_low, and low as med_low. Thus with this model we can only reliably separate high from other cases.
Reload boston and standardize.
boston <- MASS::Boston
boston <- scale(boston)
# Get pairwise distances
boston.d <- dist(boston, method = "euclidean")
# Test different number of clusters
ks <- 1:15
# Determine optimal number of cluster by total within cluster
# sum of squares
# We run k-means on the scaled data and not on the distances,
# since k-means is not applicable to be run on distance matrix alone, see e.g.
# https://stackoverflow.com/questions/43512808/ and
# https://stats.stackexchange.com/questions/32925/
twcss <- sapply(ks, function(x) { kmeans(boston, centers = x)$tot.withinss})
# Plot
qplot(x = ks, y = twcss, geom = "line") + theme_bw()
Well, it is difficult to see what is the optimal number of clusters. Three or four is probably OK, two likely not large enough although that is where the largest drop in twcss occurs. Let’s run a (classical) multidimensional scaling in 2D on the distance matrix and see if there are clear clusters.
mds.fit <- cmdscale(boston.d, k = 2)
plot(mds.fit[, 1], mds.fit[, 2],
xlab = "Dimension 1", ylab = "Dimension 2")
Well, it seems like there are two clusters obtained with MDS.
Let’s run a k-medoid on the distance matrix and use the silhouette plot this time.
library(cluster)
library(factoextra)
# adapted from https://www.r-bloggers.com/2019/01/10-tips-for-choosing-the-optimal-number-of-clusters/
fviz_nbclust(as.matrix(boston.d), pam, method = "silhouette", k.max = 12) +
theme_minimal() + ggtitle("Silhoutte plot for k-medoids")
With k-medoids, we get two clusters as the optimal.
Finally, plot dendrogram
# Choose average clustering as a "generic" one
hier.clust <- hclust(boston.d, method = "average")
plot(hier.clust, cex = 0.5)
abline(h = 5.75, col = "red")
Hierarchical clustering suggests three big clusters and one or two smaller ones.
Since in k-means we noted that 3-4 clusters could be good, decrease the number to 3 based on the short analysis with MDS, k-medoids and hclust.
Let’s visualize the results
km <- kmeans(boston, centers = 3)
cols <- rep("black", nrow(boston))
cols[km$cluster == 2] <- "blue"
cols[km$cluster == 3] <- "red"
# make more compact, hide axes
pairs(boston, col = cols,
gap = 0, xaxt = "n", yaxt = "n", pch = 3)
Pairs plot does not appear to support the hypothesis of three clusters explainable by any pair of variables, although there are clearly many plots with two groups e.g. with rad variable.
# Visualize k-means result in 2D with principal components
# (from PCA, shows also the percentage of variance explained).
factoextra::fviz_cluster(km, data = boston,
#palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())
km2 <- kmeans(boston, centers = 2)
factoextra::fviz_cluster(km2, data = boston,
#palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())
Two clusters overlap less than three, but clustering into three groups is not too bad.
Run LDA on k-means clusters (without train-test split).
boston <- as.data.frame(scale(MASS::Boston))
km <- kmeans(boston, centers = 3)
boston$class <- km$cluster
# fit LDA
lda.fit.km <- lda(class ~ ., data = boston)
# plot the lda results
plot(lda.fit.km, dimen = 2, col = boston$class, pch = boston$class)
lda.arrows(lda.fit.km, myscale = 2.5)
Separation of the groups is similar to what we had before: one clearly different, and two overlapping. rad is again important variable to separate a group from the others. This time also tax is similar in direction and magnitude as rad, which we expected from correlation analysis. age and dis separate clusters 2 and 3 stronger than the other variables. They are more or less orthogonal to rad and tax.
# tabulate how k-means clusters are reproduced
table(predict(lda.fit.km)$class, km$cluster)
##
## 1 2 3
## 1 133 0 0
## 2 0 152 8
## 3 9 10 194
# Seems to be reproduced pretty well.
Reuse train set from above, follow instructions of the course
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type = "scatter3d", mode = "markers",
color = train$crime)
Plot the same with three k-means clusters
km <- kmeans(train %>% select(-crime), centers = 3)
plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type = "scatter3d", mode = "markers",
color = as.factor(km$cluster))
Now it is more apparent how there might be three clusters. In both plots there is one high crime rate cluster, and the rest of data points form two clouds: a low-to-med_low cluster and a med_high_to_med_low cluster.
All in all, for prediction tasks, reliable models can be built for detecting high and non-high binary classes. Splitting non-high into two groups will decrease prediction accuracy.
date()
## [1] "Thu Dec 2 10:19:32 2021"
This week, we are analyzing and modeling a dataset that describes various aspect of human development and gender inequalities stratified by countries. Data description is found here and here.
Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.
rm(list=ls())
# read in the data, convert strings to factors
human <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",
header = TRUE, sep = ",", stringsAsFactors = TRUE)
dim(human)
## [1] 155 8
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Life.Exp spans the range [49, 83.50], GNI [581, 123124], maternal mortality ratio (per 100000) [1.0, 1100] and adolescent birth rate (per 100000) [0.60, 204.80].
Plot the scatterplot and correlations between the variables.
library(GGally)
library(ggplot2)
lower <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=method, ...)
p
}
p <- ggpairs(human, mapping = aes(alpha = 0.3),
lower = list(continuous = lower),
upper = list(continuous = wrap("cor", size = 2))) +
theme(text = element_text(size = 8))
# Add color to correlations
# stackoverflow.com/questions/45873483/
# correlations matrix plot
p.corr <- ggcorr(human, label = TRUE, label_round = 3,
palette = "RdBu")
# get colors
g2 <- ggplotGrob(p.corr)
colors <- g2$grobs[[6]]$children[[3]]$gp$fill
# Change background color to tiles in the upper triangular matrix of plots
idx <- 1
for (k1 in 1:(ncol(human) - 1)) {
for (k2 in (k1 + 1):ncol(human)) {
plt <- getPlot(p, k1, k2) +
theme(panel.background = element_rect(fill = colors[idx], color="white"),
panel.grid.major = element_line(color = colors[idx]))
p <- putPlot(p, plt, k1, k2)
idx <- idx + 1
}
}
p
There are many significant strong correlations. Eg. Life.Exp correlates positively with Edu.Exp, and Mat.Mor correlates positively with Ado.Birth. Life.Exp, Edu.Exp correlate negatively with Mat.Mor. Life.Exp and Ado.Birth correlate negatively as well.
Let’s examine the correlations with corrplot also to see better the groups of positive and negative correlations. Only Labo.FM and Parli.F do not correlate strongly with anything.
library(corrplot)
cor.matrix <- cor(human)
rownames(cor.matrix) <- colnames(cor.matrix)
corrplot(as.matrix(cor.matrix), method = "ellipse",
order = "AOE", number.cex = 0.5, # Use some nice ordering
tl.cex = 1, tl.col = "black", addCoef.col = 'black')
Then plot histograms to assess the distributions.
library(tidyverse)
gather(human) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_histogram() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Edu.Exp is close to normal distribution (passes Shapiro-Wilk normality test). GNI, Mat.Mor and Ado.Birth are right skewed, Life.Exp is left skewed.
Perform PCA on non-standardized data first (follow code from course’s DataCamp)
human.pca <- prcomp(human) # SVD
s <- summary(human.pca)
# rounded percentages of variance captured by each PC
pca.pr <- round(100 * s$importance[2,], digits = 1)
# create object pc_lab to be used as axis labels
pc.lab <- paste0(names(pca.pr), " (", pca.pr, "%)")
# draw a biplot
biplot(human.pca, cex = c(0.6, 0.8), col = c("grey40", "darkred"),
xlab = pc.lab[1], ylab = pc.lab[2])
Figure 1. PCA on non-standardized data. GNI’s absolute variance is very large and it ends up explaining nearly all the absolute variance.
Well, since the data is not standardized, GNI appears to capture virtually all variability in the data as it has the highest absolute variance by far. Qatar stands out along the first PC and is opposite of e.g. Chad in GNI.
sort(sapply(human, var))
## Labo.FM Edu2.FM Edu.Exp Life.Exp Parli.F Ado.Birth
## 3.951293e-02 5.838970e-02 8.067024e+00 6.942328e+01 1.319683e+02 1.690201e+03
## Mat.Mor GNI
## 4.485483e+04 3.438745e+08
Let’s standardize and replot.
human.pca <- prcomp(human, scale = TRUE)
s <- summary(human.pca)
# rounded percentages of variance captured by each PC
pca.pr <- round(100 * s$importance[2,], digits = 1)
# create object pc_lab to be used as axis labels
pc.lab <- paste0(names(pca.pr), " (", pca.pr, "%)")
# draw a biplot
biplot(human.pca, cex = c(0.6, 0.8), col = c("grey40", "darkred"),
xlab = pc.lab[1], ylab = pc.lab[2])
Figure 2. PCA on standardized data. Life expectancy, expected years of schooling and ratio of female to male populations with secondary education correlate with the first principal component and are of approximately same magnitude. They have an opposite effect to adolescent birth rate and maternal mortality. Ratio of labour force participation of females to males and percent of parliament representation for females are orthogonal to other variables and correlate with the second principal component.
Now the two first principal components explain 69.8% of variation in the data, mainly due to PC1 (53.6%). Life.Exp, Edu.Exp and Edu2.FM all point along PC1 in the same direction, Mot.Mor and Ado.Birth point in the opposite direction, which is coherent with the correlation analysis. Parli.F and Labo.FM correlate with each other and PC2; as we calculated before, correlation is not very strong.
Third principal component explains around 9.6% of the variance and the following components less. 8 principal components are needed to fully explain the variance of the data.
summary(human.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
Based on the deductions above, we can conclude that PC1 mostly captures the variance in Life.Exp, Edu.Exp and Edu2.FM, as well as Mot.Mor and Ado.Birth. It explains 53.6% of the total variance in the standardized human data. Mot.Mor and Ado.Birth point very intuitively in the opposite direction to education, education ratio of females to males, and to life expectancy.
PC2 captures variance in Parli.F and Labo.FM, explaining 16.2% of the total variance in the data. These variables exhibit their effects in the same direction along PC2. Thus female parliament and labor representation are suggested to have similar effects.
All in all, the first two PCs are enough to approximately represent the data as the total variance explained is relatively good, 69.8%.
Load the tea dataset
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
table(sapply(tea, typeof))
##
## integer
## 36
The FactoMineR tea data has 300 rows and 36 columns, all of them except for age are factors encoded into integers. It is questionnaire data, from documentation:
“We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).”
# tea %>% select(-age) %>%
tea.cat <- tea %>% select(-age)
p <-
tea %>%
select(-age) %>%
gather %>%
group_by(value) %>%
mutate(count = n()) %>%
ggplot(aes(value)) + geom_point(y = .5, aes(size = count)) +
facet_wrap("key", scales = "free", strip.position = "top", ncol = 3) +
# facet_grid("key", scales = "free", space = "free") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),
panel.spacing = unit(2, "lines")) +
coord_flip() +
theme_bw()
# https://stackoverflow.com/questions/52341385/
p
Plot age separately
ggplot(tea, aes(age)) + geom_histogram() + theme_bw()
Perform multiple correspondence analysis (with indicator matrix) on the FactoMineR tea data.
mca <- MCA(tea %>% select(-age), graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea %>% select(-age), graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.090 0.082 0.070 0.063 0.056 0.053 0.050
## % of var. 5.838 5.292 4.551 4.057 3.616 3.465 3.272
## Cumulative % of var. 5.838 11.130 15.681 19.738 23.354 26.819 30.091
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.048 0.047 0.044 0.041 0.040 0.039 0.037
## % of var. 3.090 3.053 2.834 2.643 2.623 2.531 2.388
## Cumulative % of var. 33.181 36.234 39.068 41.711 44.334 46.865 49.252
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.036 0.035 0.034 0.032 0.031 0.031 0.030
## % of var. 2.302 2.275 2.172 2.085 2.013 2.011 1.915
## Cumulative % of var. 51.554 53.829 56.000 58.086 60.099 62.110 64.025
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.028 0.027 0.026 0.025 0.025 0.024 0.024
## % of var. 1.847 1.740 1.686 1.638 1.609 1.571 1.524
## Cumulative % of var. 65.872 67.611 69.297 70.935 72.544 74.115 75.639
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 0.023 0.022 0.021 0.020 0.020 0.019 0.019
## % of var. 1.459 1.425 1.378 1.322 1.281 1.241 1.222
## Cumulative % of var. 77.099 78.523 79.901 81.223 82.504 83.745 84.967
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.018 0.017 0.017 0.016 0.015 0.015 0.014
## % of var. 1.152 1.092 1.072 1.019 0.993 0.950 0.924
## Cumulative % of var. 86.119 87.211 88.283 89.301 90.294 91.244 92.169
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 0.014 0.013 0.012 0.011 0.011 0.010 0.010
## % of var. 0.891 0.833 0.792 0.729 0.716 0.666 0.660
## Cumulative % of var. 93.060 93.893 94.684 95.414 96.130 96.796 97.456
## Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.009 0.009 0.008 0.007 0.006
## % of var. 0.605 0.584 0.519 0.447 0.390
## Cumulative % of var. 98.060 98.644 99.163 99.610 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.580 1.246 0.174 | 0.155 0.098 0.012 | 0.052 0.013
## 2 | -0.376 0.522 0.108 | 0.293 0.350 0.066 | -0.164 0.127
## 3 | 0.083 0.026 0.004 | -0.155 0.099 0.015 | 0.122 0.071
## 4 | -0.569 1.196 0.236 | -0.273 0.304 0.054 | -0.019 0.002
## 5 | -0.145 0.078 0.020 | -0.142 0.083 0.019 | 0.002 0.000
## 6 | -0.676 1.693 0.272 | -0.284 0.330 0.048 | -0.021 0.002
## 7 | -0.191 0.135 0.027 | 0.020 0.002 0.000 | 0.141 0.095
## 8 | -0.043 0.007 0.001 | 0.108 0.047 0.009 | -0.089 0.038
## 9 | -0.027 0.003 0.000 | 0.267 0.291 0.049 | 0.341 0.553
## 10 | 0.205 0.155 0.028 | 0.366 0.546 0.089 | 0.281 0.374
## cos2
## 1 0.001 |
## 2 0.021 |
## 3 0.009 |
## 4 0.000 |
## 5 0.000 |
## 6 0.000 |
## 7 0.015 |
## 8 0.006 |
## 9 0.080 |
## 10 0.052 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.182 0.504 0.031 3.022 | 0.020 0.007 0.000 0.330 |
## Not.breakfast | -0.168 0.465 0.031 -3.022 | -0.018 0.006 0.000 -0.330 |
## Not.tea time | -0.556 4.286 0.240 -8.468 | 0.004 0.000 0.000 0.065 |
## tea time | 0.431 3.322 0.240 8.468 | -0.003 0.000 0.000 -0.065 |
## evening | 0.276 0.830 0.040 3.452 | -0.409 2.006 0.087 -5.109 |
## Not.evening | -0.144 0.434 0.040 -3.452 | 0.214 1.049 0.087 5.109 |
## lunch | 0.601 1.678 0.062 4.306 | -0.408 0.854 0.029 -2.924 |
## Not.lunch | -0.103 0.288 0.062 -4.306 | 0.070 0.147 0.029 2.924 |
## dinner | -1.105 2.709 0.092 -5.240 | -0.081 0.016 0.000 -0.386 |
## Not.dinner | 0.083 0.204 0.092 5.240 | 0.006 0.001 0.000 0.386 |
## Dim.3 ctr cos2 v.test
## breakfast -0.107 0.225 0.011 -1.784 |
## Not.breakfast 0.099 0.208 0.011 1.784 |
## Not.tea time 0.062 0.069 0.003 0.950 |
## tea time -0.048 0.054 0.003 -0.950 |
## evening 0.344 1.653 0.062 4.301 |
## Not.evening -0.180 0.864 0.062 -4.301 |
## lunch 0.240 0.343 0.010 1.719 |
## Not.lunch -0.041 0.059 0.010 -1.719 |
## dinner 0.796 1.805 0.048 3.777 |
## Not.dinner -0.060 0.136 0.048 -3.777 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.031 0.000 0.011 |
## tea.time | 0.240 0.000 0.003 |
## evening | 0.040 0.087 0.062 |
## lunch | 0.062 0.029 0.010 |
## dinner | 0.092 0.000 0.048 |
## always | 0.056 0.035 0.007 |
## home | 0.016 0.002 0.030 |
## work | 0.075 0.020 0.022 |
## tearoom | 0.321 0.019 0.031 |
## friends | 0.186 0.061 0.030 |
Eigenvalue output contains the variances and the respective percentage of variance explained by each dimension.
Individuals output shows the coordinates, contribution of the individual as percent, and the squared cosine which measures the degree of association between variable categories and a dimension [1]).
Categories are detailed similarly to individuals. The v.test measures if the category is significantly different from zero along a dimension (is abs(v.test) > 2, sign gives direction along the dimension). Categorical variables section shows squared correlation ratio to the dimensions.
Before going into interpretation, let’s - for pedagogical reasons - follow the example of a FactoMineR developer and redo the MCA with age as a quantitative supplementary variable and a set of others as qualitative supplementary variables.
mca <- MCA(tea, quanti.sup = 19, quali.sup = 20:36, graph = FALSE) # 19 = "age"
summary(mca)
##
## Call:
## MCA(X = tea, quanti.sup = 19, quali.sup = 20:36, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.148 0.122 0.090 0.078 0.074 0.071 0.068
## % of var. 9.885 8.103 6.001 5.204 4.917 4.759 4.522
## Cumulative % of var. 9.885 17.988 23.989 29.192 34.109 38.868 43.390
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.065 0.062 0.059 0.057 0.054 0.052 0.049
## % of var. 4.355 4.123 3.902 3.805 3.628 3.462 3.250
## Cumulative % of var. 47.745 51.867 55.769 59.574 63.202 66.664 69.914
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.048 0.047 0.046 0.040 0.038 0.037 0.036
## % of var. 3.221 3.127 3.037 2.683 2.541 2.438 2.378
## Cumulative % of var. 73.135 76.262 79.298 81.982 84.523 86.961 89.339
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27
## Variance 0.035 0.031 0.029 0.027 0.021 0.017
## % of var. 2.323 2.055 1.915 1.821 1.407 1.139
## Cumulative % of var. 91.662 93.717 95.633 97.454 98.861 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.541 0.658 0.143 | -0.149 0.061 0.011 | -0.306 0.347
## 2 | -0.361 0.293 0.133 | -0.078 0.017 0.006 | -0.633 1.483
## 3 | 0.073 0.012 0.003 | -0.169 0.079 0.018 | 0.246 0.224
## 4 | -0.572 0.735 0.235 | 0.018 0.001 0.000 | 0.203 0.153
## 5 | -0.253 0.144 0.079 | -0.118 0.038 0.017 | 0.006 0.000
## 6 | -0.684 1.053 0.231 | 0.032 0.003 0.001 | -0.018 0.001
## 7 | -0.111 0.027 0.022 | -0.182 0.090 0.059 | -0.207 0.159
## 8 | -0.210 0.099 0.043 | -0.068 0.013 0.004 | -0.421 0.655
## 9 | 0.118 0.031 0.012 | 0.229 0.144 0.044 | -0.538 1.070
## 10 | 0.258 0.150 0.045 | 0.478 0.627 0.156 | -0.482 0.861
## cos2
## 1 0.046 |
## 2 0.409 |
## 3 0.038 |
## 4 0.030 |
## 5 0.000 |
## 6 0.000 |
## 7 0.077 |
## 8 0.174 |
## 9 0.244 |
## 10 0.158 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.166 0.495 0.025 2.756 | -0.166 0.607 0.026 -2.764 |
## Not.breakfast | -0.153 0.457 0.025 -2.756 | 0.154 0.560 0.026 2.764 |
## Not.tea time | -0.498 4.053 0.192 -7.578 | 0.093 0.174 0.007 1.423 |
## tea time | 0.386 3.142 0.192 7.578 | -0.072 0.135 0.007 -1.423 |
## evening | 0.319 1.307 0.053 3.985 | -0.058 0.053 0.002 -0.728 |
## Not.evening | -0.167 0.683 0.053 -3.985 | 0.030 0.028 0.002 0.728 |
## lunch | 0.659 2.385 0.075 4.722 | -0.390 1.018 0.026 -2.793 |
## Not.lunch | -0.113 0.410 0.075 -4.722 | 0.067 0.175 0.026 2.793 |
## dinner | -0.661 1.146 0.033 -3.136 | 0.796 2.025 0.048 3.774 |
## Not.dinner | 0.050 0.086 0.033 3.136 | -0.060 0.152 0.048 -3.774 |
## Dim.3 ctr cos2 v.test
## breakfast -0.483 6.900 0.215 -8.017 |
## Not.breakfast 0.445 6.369 0.215 8.017 |
## Not.tea time 0.265 1.886 0.054 4.027 |
## tea time -0.205 1.462 0.054 -4.027 |
## evening 0.451 4.312 0.106 5.640 |
## Not.evening -0.236 2.254 0.106 -5.640 |
## lunch 0.301 0.822 0.016 2.160 |
## Not.lunch -0.052 0.141 0.016 -2.160 |
## dinner 0.535 1.235 0.022 2.537 |
## Not.dinner -0.040 0.093 0.022 -2.537 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.025 0.026 0.215 |
## tea.time | 0.192 0.007 0.054 |
## evening | 0.053 0.002 0.106 |
## lunch | 0.075 0.026 0.016 |
## dinner | 0.033 0.048 0.022 |
## always | 0.045 0.001 0.101 |
## home | 0.005 0.000 0.134 |
## work | 0.112 0.043 0.005 |
## tearoom | 0.372 0.022 0.008 |
## friends | 0.243 0.015 0.103 |
##
## Supplementary categories (the 10 first)
## Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2
## F | 0.151 0.033 3.158 | -0.109 0.017 -2.278 | -0.048 0.003
## M | -0.221 0.033 -3.158 | 0.159 0.017 2.278 | 0.070 0.003
## employee | -0.153 0.006 -1.313 | -0.151 0.006 -1.289 | 0.103 0.003
## middle | -0.030 0.000 -0.205 | 0.336 0.017 2.281 | -0.284 0.012
## non-worker | -0.036 0.000 -0.324 | 0.185 0.009 1.666 | -0.291 0.023
## other worker | 0.040 0.000 0.187 | 0.013 0.000 0.061 | -0.063 0.000
## senior | 0.415 0.023 2.608 | 0.072 0.001 0.452 | -0.187 0.005
## student | 0.032 0.000 0.305 | -0.317 0.031 -3.022 | 0.394 0.047
## workman | -0.417 0.007 -1.473 | 0.249 0.003 0.878 | 0.343 0.005
## Not.sportsman | -0.030 0.001 -0.426 | 0.018 0.000 0.260 | -0.051 0.002
## v.test
## F -0.998 |
## M 0.998 |
## employee 0.884 |
## middle -1.928 |
## non-worker -2.620 |
## other worker -0.289 |
## senior -1.177 |
## student 3.760 |
## workman 1.209 |
## Not.sportsman -0.721 |
##
## Supplementary categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.033 0.017 0.003 |
## SPC | 0.032 0.053 0.076 |
## Sport | 0.001 0.000 0.002 |
## age_Q | 0.008 0.077 0.146 |
## frequency | 0.094 0.006 0.064 |
## escape.exoticism | 0.000 0.007 0.000 |
## spirituality | 0.005 0.000 0.016 |
## healthy | 0.000 0.000 0.008 |
## diuretic | 0.004 0.000 0.013 |
## friendliness | 0.071 0.001 0.013 |
##
## Supplementary continuous variable
## Dim.1 Dim.2 Dim.3
## age | 0.042 | 0.204 | -0.340 |
The output of summary contains now also supplementary categories, for which there are no contributions as they are not used to build the explanatory dimensions.
Plot first the quantitative supplementary age variable. It seems age does not correlate strongly with the first two dimensions.
plot.MCA(mca, choix = "quanti.sup")
Then plot variable graph for the correlation ratios. where, tearoom and how correlate with the first dimension more than other variables. Similarly where, price and how correlate with the second dimension more than the other variables. Active variables are in red, suppl. categorical variables in green, and suppl. continuous variable age in blue.
plot.MCA(mca, choix = "var", xlim = c(0, 0.75), ylim = c(0, 0.75))
Then plot the graph for individuals and the category values. The plot is quite busy; it appears that the values tea shop (where), unpackaged (how) and p_upscale (price) are important for dimension 2. other(How [sic]) and tearoom (tearoom), chain store+tea shop (where) are important for dimension 1. Other data points are more clustered towards the middle.
plot.MCA(mca, choix = "ind")
Plot next only individuals and for the other plot the active categories. Plot only 50 individuals and 15 category values that contribute the most to the dimensions. Color the individuals by where variable.
For individuals, there appears to be a denser region in the third quadrant. Individuals contributing the most are on the periphery of the point cloud. It seems the individuals are separated nicely by where they get their tea from (where). This is not too suprising, as that variable was found to be important in the earlier plots.
Conclusions for the active categories are as before.
op <- par(mfcol=c(1, 2),
mar = c(4, 4, 2, 1),
mgp = c(2, 1, 0))
plot(mca, invisible = c("var", "quali.sup", "quanti.sup"),
title = "Individuals", select = "contrib 15", habillage = "where")
plot(mca, invisible = c("ind", "quali.sup", "quanti.sup"),
title = "Active categories", autoLab = "yes", cex = 0.5,
selectMod = "contrib 15")
par(op) # return original par
(more chapters to be added similarly as we proceed with the course!)